This workshop focuses on tools for spatial computation in R. Topics covered are:
Spatial computation refers to data wrangling with spatial data. It enables the construction of new datasets and measures by joining otherwise un-linked data across space or measuring the space between objects.
Useful references:
Note that this workshop does not focus on spatial statistics. Spatial statistics refers to a set of statistical models that account for spatial autocorrelation. Many analyses that require spatial statistics involve very little spatial computation, and many analyses that rely on spatial computation do not require any spatial statistics.
This article is a good introduction to spatial statistics, if people want to learn more about it. Also, in a previous iteration of this workshop, my friend and colleague Chris Hess did a second half of this workshop about spatial statistics, which you can find here.
Research Question: Do police shootings affect election turnout in nearby neighborhoods?
Data:
Methods:
Results
Police shootings lead to increased turnout in subsequent elections, but only in the same Census block where the shooting occurs.
Research Question: To what extent is America geographically segregated by partisanship?
Data: National voter registration database with 180 million registered voters
Methods:
Results:
Extremely granular measure of partisan segregation across the country.
Research Question: How should we measure voter turnout by race/ethnicity at the precinct level?
Data: Voter registration records following the 2018 Georgia Gubernatorial election
Methods: Compare two spatial computation approaches
Results:
BISG is more accurate as long as precincts are sufficiently homogeneous. In more diverse precincts, bias from BISG gets worse than the noise from interpolation.
# Load packages
suppressPackageStartupMessages({
library(tidyverse)
library(tidycensus)
library(sfheaders)
library(patchwork)
library(sf)
library(censusxy)
library(spatialreg)
library(spdep)
library(mgcv)
})
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'
# Set plot theme
plot_theme <-
theme_void() +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 18, face = "bold"),
axis.title.x = element_text(margin = margin(t = 12)),
axis.title.y = element_text(margin = margin(r = 12)),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5))
sf stands for simple features. Simple features are a formal standard through which real-world objects can be represented by computers. It is implemented by most spatial data software and databases. This includes probably all the spatial software you’ve heard of, including ArcGIS, GDAL, etc.
Because sf is a standard that underlies all of these tools, users can basically ignore filetypes or application-specific restrictions. If it’s spatial data, you can probably read it in with read_sf() / st_read().
R has two toolkits for spatial data analysis; sf and sp. sp is older, more well-established, and many people are more familiar with it. That said, I recommend learning sf instead of sp.
The Geocomputation with R text lists the following advantages:
Moreover, from the sf documentation, it “aims at succeeding sp in the long term”. The list of tasks for which sp outperforms sf is dwindling, and will eventually be zero.
# Read in data
sf_precincts <- read_sf("data/fulton_gwinnett.gpkg")
class(sf_precincts)
## [1] "sf" "tbl_df" "tbl" "data.frame"
head(sf_precincts)
## Simple feature collection with 6 features and 16 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -12511.88 ymin: 412311.1 xmax: -4226.666 ymax: 436447.4
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 × 17
## precinct_id_2018 county abrams_prop kemp_prop metz_prop total_votes
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Fulton,08P Fulton 0.831 0.159 0.00974 924
## 2 Fulton,Ss09B Fulton 0.459 0.526 0.0143 2445
## 3 Fulton,03A Fulton 0.981 0.0145 0.00435 690
## 4 Fulton,07J Fulton 0.647 0.334 0.0195 1745
## 5 Fulton,09E Fulton 0.946 0.0473 0.00636 1416
## 6 Fulton,12A Fulton 0.957 0.0370 0.00586 2731
## # … with 11 more variables: whi_true_total <dbl>, bla_true_total <dbl>,
## # his_true_total <dbl>, asi_true_total <dbl>, oth_true_total <dbl>,
## # whi_true <dbl>, bla_true <dbl>, his_true <dbl>, asi_true <dbl>,
## # oth_true <dbl>, geom <MULTIPOLYGON [m]>
The sf object behaves exactly like a tibble, except it has one extra column, geom, containing spatial information. Aside from messing with the geom column, you can pretty much treat the sf object like a data frame. For example, I can get the margin of victory for Stacey Abrams over Brian Kemp in large precincts usng basic dplyr verbs.
margin_large_precincts <- sf_precincts %>%
filter(total_votes > 4000) %>%
select(precinct_id_2018, county, ends_with("prop"), total_votes) %>%
mutate(abrams_pct_margin = (abrams_prop - kemp_prop)*total_votes)
class(margin)
## [1] "function"
margin_large_precincts
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -30873.14 ymin: 397015.3 xmax: 324.0863 ymax: 458847.4
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 × 8
## precinct_id_2018 county abrams_prop kemp_prop metz_prop total_votes
## * <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Fulton,Sc15 Fulton 0.971 0.0278 0.00166 4209
## 2 Fulton,07A Fulton 0.567 0.422 0.0118 4077
## 3 Fulton,Uc02A Fulton 0.971 0.0268 0.00233 4287
## 4 Fulton,06D Fulton 0.802 0.186 0.0119 6046
## 5 Fulton,Rw09 Fulton 0.353 0.632 0.0143 4205
## 6 Fulton,Rw12 Fulton 0.401 0.581 0.0183 4218
## # … with 2 more variables: geom <MULTIPOLYGON [m]>, abrams_pct_margin <dbl>
Some operations will revert the object to a dataframe. A common case is a join. When executing a join, R will always apply the class of the first object entered in the join. This can cause the sf class to get dropped.
Below I try to add in a FIPS code to the sf object with a join. If I do it putting the dataframe first, I’ll lose the sf class. I have to do the other way around.
countyfips <- data.frame(
"county" = c("Fulton", "Gwinnett"),
"FIPS" = c("13121", "13135")
)
sf_georgia_w_fips = right_join(countyfips, sf_precincts)
## Joining, by = "county"
class(sf_georgia_w_fips)
## [1] "data.frame"
sf_georgia_w_fips = right_join(sf_precincts, countyfips)
## Joining, by = "county"
class(sf_georgia_w_fips)
## [1] "sf" "tbl_df" "tbl" "data.frame"
What’s going on in the special geometry column?
sf_precincts %>%
select(geom) %>%
head()
## Simple feature collection with 6 features and 0 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -12511.88 ymin: 412311.1 xmax: -4226.666 ymax: 436447.4
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 × 1
## geom
## <MULTIPOLYGON [m]>
## 1 (((-6803.698 422854.3, -6879.647 422779.3, -6904.589 422757.3, -6972.799 4226…
## 2 (((-5618.893 432298.6, -5605.268 432303.6, -5590.276 432307.6, -5574.961 4323…
## 3 (((-9733.036 419134, -9729.344 419120.6, -9722.918 419075.5, -9720.44 419053.…
## 4 (((-5155.174 425017.6, -5151.01 425008.7, -5140.385 424985.8, -5132.75 424974…
## 5 (((-10372.33 420976.9, -10575.86 420923.2, -10635.16 420915.8, -10607.59 4209…
## 6 (((-9778.072 412374.1, -9769.025 412456.7, -9764.948 412507.3, -9758.865 4125…
The geom column contains a vector of multidimensional points describing the geographic shape. There are several geometry types, the full list of which can be found in the texts above. In my experience, most social science work involves POINT, POLYGON and MULTIPOLYGON types.
Points are just points
points <- rbind(c(1,1), c(1, 2), c(2, 1), c(2, 2))
points <- st_multipoint(points)
plot(points)
Polygons are built up of linestrings
polygon_strings <- rbind(c(1,1), c(1, 2), c(2, 2), c(2, 1), c(1, 1))
polygon_list = list(polygon_strings)
plot(st_polygon(polygon_list))
Multipolygons are collections of polygons. This enable describing multiple distinct separate polygons within the same row of an sf object. They would be the right type if, for example, you wanted to plot Japan alongside other countries. Multipolygons can also have holes, whereas polygons cannot.
polygon_strings_2 <- rbind(c(3,1), c(3, 2), c(4, 2), c(4, 1), c(3, 1))
multipolygon = st_polygon(list(polygon_strings, polygon_strings_2))
plot(multipolygon)
In general, R will take care of defining the geometries for you. Usually you’ll know to expect either points or polygons based on the nature of your data. For instance, traffic stops or schools will probably be stored as points, while school districts or election precincts will be polygons.
Also in general, you might find polygons and multipolygons being used interchangeably. In this dataset all the precincts are labelled as multipolygons even though most of them could be polygons. As far as I know, this is fine. Multipolygon accepts a broader range of shapes and won’t break if two precincts happen to be separated.
What exactly are the numbers? These depend on your coordinate reference system, which we discuss below.
sf is well-integrated with ggplot. This makes map-construction in R pretty easy, at least for simple plots.
ggplot(sf_precincts) +
geom_sf(color = "white", aes(fill = county)) +
plot_theme
geom_sf() behaves much like other plot types. You can set the fills equal to aesthetics easily. Let’s see where votes for Stacey Abrams are concentrated…
ggplot(sf_precincts) +
geom_sf(color = "NA", aes(fill = abrams_prop)) +
plot_theme
To visualize the borders between the counties, we can take advantage of the fact that summarize dissolves geographic areas.
sf_counties <- sf_precincts %>%
group_by(county) %>%
summarize() %>%
nngeo::st_remove_holes()
head(sf_counties)
## Simple feature collection with 2 features and 1 field
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -49386.5 ymin: 391000.8 xmax: 49071.09 ymax: 466262
## Projected CRS: NAD83(2011) / Georgia East
## county geometry
## 1 Fulton MULTIPOLYGON (((-36385.64 3...
## 2 Gwinnett POLYGON ((17524.69 427185.8...
ggplot() +
geom_sf(data = sf_precincts, color = "NA", aes(fill = abrams_prop)) +
geom_sf(data = sf_counties, color = 'grey', size = 1.5, fill = "NA") +
plot_theme
When working with data in the US, it’s helpful to know that the tidycensus package can return sf objects when you use to query the API. For instance, what if we wanted to place these counties within the state of Georgia?
# Note that to run this yourself you will need to set up an API key. See here:
# https://walker-data.com/tidycensus/articles/basic-usage.html
sf_georgia <- get_acs(
geography = "state",
year = 2018,
variables = "B00001_001",
state = "Georgia",
geometry = TRUE
)
## Getting data from the 2014-2018 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
sf_georgia
## Simple feature collection with 1 feature and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -85.60516 ymin: 30.35785 xmax: -80.83973 ymax: 35.00066
## Geodetic CRS: NAD83
## GEOID NAME variable estimate geometry
## 1 13 Georgia B00001_001 651000 MULTIPOLYGON (((-81.27939 3...
ggplot() +
geom_sf(data = sf_precincts, color = "NA", fill = "black") +
geom_sf(data = sf_georgia, color = "black", fill = "NA") +
plot_theme
TLDR; CRS defines the units for your points.
The two-dimensional planes upon which the above maps are plotted makes the units of maps seem straightforward, when in reality they are not. We have known for some time that the earth is not a flat two-dimensional plane. It is round, more of an ellispoid than a sphere, and covered with peaks and valleys.
CRS’s are different measurement systems that have different strengths and weaknesses when it comes to handling the complexity of the Earth’s shape. All of them have some risk of being innaccurate, and this might matter if you are doing highly precise spatial operations. There are lots of different CRS’s you can read about here.
Also, A CRS can be ‘geographic’ or ‘projected’. A geographic CRS retains the true shape of the earth, namely that it is round. A projected CRS represents the earth’s surface in two dimensions, which makes plotting feasible.
sf objects typically come with a preassigned CRS:
st_crs(sf_precincts)
## Coordinate Reference System:
## User input: NAD83(2011) / Georgia East
## wkt:
## PROJCRS["NAD83(2011) / Georgia East",
## BASEGEOGCRS["NAD83(2011)",
## DATUM["NAD83 (National Spatial Reference System 2011)",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",6318]],
## CONVERSION["SPCS83 Georgia East zone (meters)",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",30,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-82.1666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9999,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",200000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United States (USA) - Georgia - counties of Appling; Atkinson; Bacon; Baldwin; Brantley; Bryan; Bulloch; Burke; Camden; Candler; Charlton; Chatham; Clinch; Coffee; Columbia; Dodge; Echols; Effingham; Elbert; Emanuel; Evans; Franklin; Glascock; Glynn; Greene; Hancock; Hart; Jeff Davis; Jefferson; Jenkins; Johnson; Lanier; Laurens; Liberty; Lincoln; Long; Madison; McDuffie; McIntosh; Montgomery; Oglethorpe; Pierce; Richmond; Screven; Stephens; Taliaferro; Tattnall; Telfair; Toombs; Treutlen; Ware; Warren; Washington; Wayne; Wheeler; Wilkes; Wilkinson."],
## BBOX[30.36,-83.47,34.68,-80.77]],
## ID["EPSG",6444]]
This is a CRS I selected. I don’t understand most of this description. I do know this is part of a series of NAD83 Projected CRS’s. NAD83 is one of the two most popular CRS’s. The other is WGS84. WGS84 is what’s called a geocentric datum, meaning its center is the center of the Earth and it is agnostic to abnormalities on surface of the Earth. In contrast, NAD83 is a geocentric datum, meaning it is optimized differently for use in different regions. I picked one that is optimized for use in Georgia.
This CRS was set by running st_transform(sf_precincts, 6444). 6444 is the epsg code for the CRS. epsg codes are the easy way to CRSs in R. The other is to write a proj4string, which involves writing out the specifics of the CRS in detail. I’ve never done this before.
CRS and projections are a very complicated topic and there’s a lot of options available. I think it’s easy to get bogged down in this. Probably the best approach is to find a CRS/projection combination that works well for your data and be consistent about applying it. As long as you do this, the remainder of your workflow should go smoothly. You can see a list of the available epsg codes using rgdal::make_EPSG()
crs_data <- rgdal::make_EPSG()
head(crs_data)
## code note
## 1 2000 Anguilla 1957 / British West Indies Grid
## 2 2001 Antigua 1943 / British West Indies Grid
## 3 2002 Dominica 1945 / British West Indies Grid
## 4 2003 Grenada 1953 / British West Indies Grid
## 5 2004 Montserrat 1958 / British West Indies Grid
## 6 2005 St. Kitts 1955 / British West Indies Grid
## prj4
## 1 +proj=tmerc +lat_0=0 +lon_0=-62 +k=0.9995 +x_0=400000 +y_0=0 +a=6378249.145 +rf=293.465 +units=m +no_defs +type=crs
## 2 +proj=tmerc +lat_0=0 +lon_0=-62 +k=0.9995 +x_0=400000 +y_0=0 +a=6378249.145 +rf=293.465 +units=m +no_defs +type=crs
## 3 +proj=tmerc +lat_0=0 +lon_0=-62 +k=0.9995 +x_0=400000 +y_0=0 +a=6378249.145 +rf=293.465 +units=m +no_defs +type=crs
## 4 +proj=tmerc +lat_0=0 +lon_0=-62 +k=0.9995 +x_0=400000 +y_0=0 +a=6378249.145 +rf=293.465 +units=m +no_defs +type=crs
## 5 +proj=tmerc +lat_0=0 +lon_0=-62 +k=0.9995 +x_0=400000 +y_0=0 +a=6378249.145 +rf=293.465 +units=m +no_defs +type=crs
## 6 +proj=tmerc +lat_0=0 +lon_0=-62 +k=0.9995 +x_0=400000 +y_0=0 +a=6378249.145 +rf=293.465 +units=m +no_defs +type=crs
## prj_method
## 1 Transverse Mercator
## 2 Transverse Mercator
## 3 Transverse Mercator
## 4 Transverse Mercator
## 5 Transverse Mercator
## 6 Transverse Mercator
crs_data %>%
filter(code == 6444)
## code note
## 1 6444 NAD83(2011) / Georgia East
## prj4
## 1 +proj=tmerc +lat_0=30 +lon_0=-82.1666666666667 +k=0.9999 +x_0=200000 +y_0=0 +ellps=GRS80 +units=m +no_defs +type=crs
## prj_method
## 1 Transverse Mercator
Lots of projects involving spatial analyses begin from address data. Addresses must be converted into points via geocoding. There are many geocoding services available and the options are always changing. For the most part, this is a commercial market. Some common alternatives include:
censusxy)The Census API is generally very easy to use. I won’t do a full demo because I don’t want to make a ton of addresses from a real dataset public. Here’s a quick demo with one address line.
geo <- cxy_single(street = "171 East State Street", city = "Ithaca", state = "NY", zip = 14850)
geo
## tigerLine.side tigerLine.tigerLineId coordinates.x coordinates.y
## 1 R 18857717 -76.49751 42.43958
## addressComponents.zip addressComponents.streetName addressComponents.preType
## 1 14850 STATE
## addressComponents.city addressComponents.preDirection
## 1 ITHACA E
## addressComponents.suffixDirection addressComponents.fromAddress
## 1 101
## addressComponents.state addressComponents.suffixType
## 1 NY ST
## addressComponents.toAddress addressComponents.suffixQualifier
## 1 199
## addressComponents.preQualifier matchedAddress
## 1 171 E STATE ST, ITHACA, NY, 14850
Then to convert a geocoded address to an sf object:
geo_sf <- st_as_sf(geo, coords = c('coordinates.x', 'coordinates.y'))
head(geo_sf)
## Simple feature collection with 1 feature and 15 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.49751 ymin: 42.43958 xmax: -76.49751 ymax: 42.43958
## CRS: NA
## tigerLine.side tigerLine.tigerLineId addressComponents.zip
## 1 R 18857717 14850
## addressComponents.streetName addressComponents.preType addressComponents.city
## 1 STATE ITHACA
## addressComponents.preDirection addressComponents.suffixDirection
## 1 E
## addressComponents.fromAddress addressComponents.state
## 1 101 NY
## addressComponents.suffixType addressComponents.toAddress
## 1 ST 199
## addressComponents.suffixQualifier addressComponents.preQualifier
## 1
## matchedAddress geometry
## 1 171 E STATE ST, ITHACA, NY, 14850 POINT (-76.49751 42.43958)
Next, we’ll work through examples of common spatial operations. A bunch of them require some point data, so we’ll use this dataset with point-located sports venues throughout Georgia.
sf_sports <- read_sf("data/sports_venues.gpkg")
head(sf_sports)
## Simple feature collection with 6 features and 27 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -9394569 ymin: 3650717 xmax: -9062026 ymax: 4022064
## Projected CRS: WGS 84 / Pseudo-Mercator
## # A tibble: 6 × 28
## OBJECTID VENUEID NAME ADDRESS CITY STATE ZIP ZIP4 TELEPHONE TYPE STATUS
## <int> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 72 99 BOBBY… 155 NO… ATLA… GA 30332 NOT … (404) 89… SING… OPEN
## 2 73 100 SANFO… SANFOR… ATHE… GA 30605 NOT … (706) 54… SING… OPEN
## 3 204 240 GEORG… 755 HA… ATLA… GA 30315 NOT … (404) 41… SING… OPEN
## 4 239 276 EAST … 2575 A… ATLA… GA 30317 NOT … (404) 37… SING… OPEN
## 5 240 277 AUGUS… 807 EI… AUGU… GA 30904 NOT … (706) 73… SING… OPEN
## 6 241 278 SEA I… 100 RE… ST S… GA 31522 NOT … (800) 73… SING… OPEN
## # … with 17 more variables: POPULATION <dbl>, COUNTY <chr>, COUNTYFIPS <chr>,
## # COUNTRY <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, NAICS_CODE <chr>,
## # NAICS_DESC <chr>, SOURCE <chr>, SOURCEDATE <date>, VAL_METHOD <chr>,
## # VAL_DATE <date>, NAME2 <chr>, NAME3 <chr>, FORMERNM <chr>, YEARCHNG <chr>,
## # geom <POINT [m]>
If we want to use this sf object together with our election precincts one, we first need to line up their CRS.
sf_sports <- st_transform(sf_sports, st_crs(sf_precincts))
st_crs(sf_sports) == st_crs(sf_precincts)
## [1] TRUE
Where are these sports stadiums?
sf_georgia <- st_transform(sf_georgia, st_crs(sf_precincts))
ggplot() +
geom_sf(data = sf_precincts, color = "NA", fill = "black") +
geom_sf(data = sf_georgia, color = "black", fill = "NA") +
geom_sf(data = sf_sports, color = "red", size = 2) +
plot_theme
We can get rid of the ones not inside the counties we care about using a filter.
sf_sports_gf <- st_filter(sf_sports, sf_precincts)
ggplot() +
geom_sf(data = sf_precincts, color = "NA", fill = "black") +
geom_sf(data = sf_georgia, color = "black", fill = "NA") +
geom_sf(data = sf_sports, color = "red", size = 2) +
geom_sf(data = sf_sports_gf, color = "green", size = 4) +
plot_theme
At this point it’s still going to be hard to do any analysis because the information we’re interested in is split across two objects. The sports stadiums are in sf_sports, and the precincts are in sf_precincts. If we want to analyse the effect of having a sports stadium on a precincts’ election results (we don’t, but like if we pretend we do), we need to get a variable in the sf_precincts object that says whether or not a precinct contains a sports stadium.
ggplot() +
geom_sf(data = sf_precincts, color = "grey", fill = "white") +
geom_sf(data = sf_sports_gf, color = "red", size = 1) +
plot_theme
To get them together, we can do a spatial join
sf_precincts$n_stadiums <- lengths(st_intersects(sf_precincts, sf_sports_gf))
ggplot() +
geom_sf(data = sf_precincts, color = "grey", aes(fill = as.factor(n_stadiums))) +
geom_sf(data = sf_sports_gf, color = "red", size = 1) +
scale_fill_brewer(type = "seq") +
plot_theme
Now let’s try computing the distance from each precinct to the nearest stadium. The first problem we face is that we can theoretically choose any point within a polygon from which to compute a distance.
Typically people use centroids for this. You can think of centroids as points at the middle of the polygon.
In actuality it’s a bit more complicated than that. A the most straightforward centroid location will be computed by drawing a box around the maximum and minimum heights and widths of a polygon, then taking the direct center of the box. However, you can imagine situations where this method ends up picking a centroid that isn’t even in the intended polygon, like if the polygon is L-shaped. There are various other formulations that modify the basic computation to avoid this problem, and which one you choose may affect results in some instances. In most datasets, the choice of centroid computation probably matters for some small (but maybe important) subset of the shapes.
st_centroid() computes the centroid as described above.st_point_on_surface() computes a variation of centroids that ensures an area’s centroid is within its boundaries.sf_precinct_centroids <- st_centroid(sf_precincts)
## Warning in st_centroid.sf(sf_precincts): st_centroid assumes attributes are
## constant over geometries of x
sf_precinct_centsonsurface <- st_point_on_surface(sf_precincts)
## Warning in st_point_on_surface.sf(sf_precincts): st_point_on_surface assumes
## attributes are constant over geometries of x
ggplot() +
geom_sf(data = sf_precincts, color = "grey", fill = "white") +
geom_sf(data = sf_precinct_centroids, color = "red", size = 1) +
geom_sf(data = sf_precinct_centsonsurface, color = "blue", size = 1) +
plot_theme
Getting distance to the nearest stadium st_nearest_feature() takes in to sf objects and returns a vector of row indices from the the second object that are closest to each shape in the first.
nearest <- st_nearest_feature(sf_precinct_centroids, sf_sports)
nearest[1:10]
## [1] 9 19 18 9 9 3 3 18 9 17
So the first precinct centroid is closest to the stadium in the ninth row of the sports object, and so on. We can do some old-fashioned indexing to get the distances between these points, then add it back to sf_precincts
dist_to_nearest <- st_distance(sf_precinct_centroids, sf_sports[nearest,], by_element = TRUE)
sf_precincts$dist_to_nearest <- as.numeric(dist_to_nearest)
ggplot() +
geom_sf(data = sf_precincts, color = "NA", aes(fill = dist_to_nearest/1000)) +
geom_sf(data = sf_sports_gf, color = "red", size = 1) +
scale_fill_gradient2(low = "black", mid = "grey10", high = "white") +
plot_theme
Sometimes you want a measure in one geographic unit, but you only have it in another. This is like the paper above. We can use areal-weighted interpolation to crudely transfer the measure from one spatial unit to another.
Interpolation is crude because it relies on the assumption that whatever characteristic of an area you’re interpolating is spread evenly across the initial geographic surface. This is a very strong assumption that almost never holds. In this case, the populations of different race groups are most certainly not spread across census areas. Nonetheless, this tool can be useful when there are no good alternatives.
Here we’re going to grab data on the voting-aged populations of different race groups at the block-group level and interpolate this over to the election precincts. We could use Census blocks, but it would be slower.
georgia_counties <- c("121", "135")
sf_vap <- purrr::map_dfr(
georgia_counties,
~ tidycensus::get_decennial(
geography = "block group",
variables = c(
"P011002", # total hispanic, 18+
"P011005", # total white, 18+
"P011006", # total black, 18+
"P011008" # total asian, 18+
),
year = 2010,
summary_var = "P011001", # total population, 18+,
state = "GA",
county = .,
geometry = TRUE,
cache = TRUE,
output = "wide"
)
)
## Getting data from the 2010 decennial Census
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using Census Summary File 1
## Getting data from the 2010 decennial Census
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## Using Census Summary File 1
sf_vap <- sf::st_transform(sf_vap, sf::st_crs(sf_precincts))
head(sf_vap)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -8255.715 ymin: 421312 xmax: -2039.175 ymax: 424103.3
## Projected CRS: NAD83(2011) / Georgia East
## # A tibble: 6 × 8
## GEOID NAME P011002 P011005 P011006 P011008 summary_value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 131210001001 Block Group 1, Cen… 13 578 8 11 617
## 2 131210002001 Block Group 1, Cen… 43 779 100 30 969
## 3 131210002005 Block Group 5, Cen… 30 653 25 23 745
## 4 131210006001 Block Group 1, Cen… 243 2289 1170 1113 4970
## 5 131210010011 Block Group 1, Cen… 65 519 85 741 1471
## 6 131210011001 Block Group 1, Cen… 57 784 114 72 1052
## # … with 1 more variable: geometry <MULTIPOLYGON [m]>
We can see that the two maps have different boundary lines
ggplot() +
geom_sf(data = sf_precincts, fill = NA, color = 'red', size = .2) +
geom_sf(data = sf_vap, fill = NA, color = 'blue', size = .2) +
plot_theme
We can interpolate from one dataset to the other. There is a built-in aw_interpolate function is sf, but I have found the version from the areal package tends to perform better, at least for this problem.
sf_precincts <- areal::aw_interpolate(
sf_precincts,
tid = precinct_id_2018,
source = sf_vap,
sid = GEOID,
output = "sf",
weight = "sum",
extensive = c(
"P011002", "P011005",
"P011006", "P011008", "summary_value"
)
)
Here we do some processing to construct measures of racial composition.
sf_precincts <- sf_precincts %>%
dplyr::rename(
"whi_2010_vap_total" = "P011005",
"bla_2010_vap_total" = "P011006",
"his_2010_vap_total" = "P011002",
"asi_2010_vap_total" = "P011008",
"vap_total" = "summary_value"
)
sf_precincts$oth_2010_vap_total <- sf_precincts$vap_total -
sf_precincts$whi_2010_vap_total -
sf_precincts$bla_2010_vap_total -
sf_precincts$his_2010_vap_total -
sf_precincts$asi_2010_vap_total
sf_precincts <- sf_precincts %>%
dplyr::mutate(oth_2010_vap_block_total = ifelse(
oth_2010_vap_total < 0,
0,
oth_2010_vap_total
)) %>%
dplyr::mutate(
"whi_2010_vap_prop" = ifelse(vap_total == 0, 0, whi_2010_vap_total / vap_total),
"bla_2010_vap_prop" = ifelse(vap_total == 0, 0, bla_2010_vap_total / vap_total),
"his_2010_vap_prop" = ifelse(vap_total == 0, 0, his_2010_vap_total / vap_total),
"asi_2010_vap_prop" = ifelse(vap_total == 0, 0, asi_2010_vap_total / vap_total),
"oth_2010_vap_prop" = ifelse(vap_total == 0, 0, oth_2010_vap_total / vap_total)
)
Now we can compare the composition estimates from interpolation to the actual self-reported race estimates
p1 <- ggplot(sf_precincts) +
geom_sf(aes(fill = whi_true)) +
scale_fill_gradient2(low = "black", mid = "grey10", high = "white") +
plot_theme +
ggtitle("Self-Reported") +
theme(legend.position = "None",
axis.text = element_blank())
p2 <- ggplot(sf_precincts) +
geom_sf(aes(fill = whi_2010_vap_prop)) +
scale_fill_gradient2(low = "black", mid = "grey10", high = "white") +
plot_theme +
ggtitle("Interpolated") +
theme(legend.position = "None",
axis.text = element_blank())
p1 + p2